In the second part of the project, a descriptive analysis was conducted on the data related to stock prices of the considered commodity. As part of the analysis, descriptive statistics of features were interpreted and visualized on graphs.
# import required modules
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score, precision_score, recall_score, f1_score
warnings.filterwarnings('ignore')
sns.set_context("paper", rc={"font.size":13,"axes.titlesize":13,"axes.labelsize":13})
palette = ["#82c2b2", "#33827b", "#344b45", "#63bff0", "#bbbbbb", "#aaaaaa"]
sns.set_palette(sns.set_palette(palette))
sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pio.renderers.default = 'colab'
seed = 123
The dataset contains observations from 1607 trading days, starting from December 2017 to May 2022. The raw data include the following information about commodity price with respect to 3 different markets:
Examples of the daily historical stock prices and volumes are shown in the table below.
df = pd.read_csv("data.csv")
df.head()
| DateTime | Market1: Close | Market1: High | Market1: Low | Market1: Open | Market2: Close | Market2: High | Market2: Low | Market2: Open | Market2: Open_Interest | Market2: Volume | Market3: Close | Market3: High | Market3: Low | Market3: Open | Market3: Open_Interest | Market3: Volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-12-14 0:00:00 | 276.600 | 282.700 | 274.400 | 281.400 | 6145 | 6275.500 | 6095.500 | 6255.000 | 301759.000 | 78077.000 | 45250.000 | 45570.000 | 44990.000 | 45500.000 | 704092.000 | 189194.000 |
| 1 | 2017-12-15 0:00:00 | 276.700 | 278.200 | 275.650 | 276.950 | 6145 | 6178.000 | 6114.000 | 6150.000 | 299683.000 | 50088.000 | 44570.000 | 44910.000 | 43980.000 | 44210.000 | 698540.000 | 447986.000 |
| 2 | 2017-12-16 0:00:00 | 275.850 | 277.650 | 274.800 | 276.000 | 6115 | 6151.500 | 6091.500 | 6130.000 | 297582.000 | 53173.000 | 44660.000 | 44840.000 | 44420.000 | 44640.000 | 700286.000 | 317816.000 |
| 3 | 2017-12-17 0:00:00 | 276.950 | 279.400 | 275.350 | 276.400 | 6104 | 6189.000 | 6103.850 | 6126.500 | 300723.000 | 55424.000 | 44740.000 | 44890.000 | 44580.000 | 44650.000 | 695008.000 | 208524.000 |
| 4 | 2017-12-18 0:00:00 | 275.450 | 277.600 | 273.850 | 276.700 | s | 6148.500 | 6073.500 | 6132.000 | 303841.000 | 48765.000 | 44550.000 | 45050.000 | 44250.000 | 45000.000 | 725376.000 | 384066.000 |
df.shape
(1607, 17)
The printout below summarises number of missing values and datatypes of each variable.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1607 entries, 0 to 1606 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DateTime 1607 non-null object 1 Market1: Close 1555 non-null float64 2 Market1: High 1555 non-null float64 3 Market1: Low 1555 non-null float64 4 Market1: Open 1555 non-null float64 5 Market2: Close 1563 non-null object 6 Market2: High 1563 non-null float64 7 Market2: Low 1563 non-null float64 8 Market2: Open 1563 non-null float64 9 Market2: Open_Interest 1548 non-null float64 10 Market2: Volume 1563 non-null float64 11 Market3: Close 1503 non-null float64 12 Market3: High 1503 non-null float64 13 Market3: Low 1503 non-null float64 14 Market3: Open 1502 non-null float64 15 Market3: Open_Interest 1503 non-null float64 16 Market3: Volume 1503 non-null float64 dtypes: float64(15), object(2) memory usage: 213.6+ KB
orig_colnames = df.columns
cols = ['date', 'close_1', 'high_1', 'low_1', 'open_1',
'close_2', 'high_2', 'low_2', 'open_2', 'open_interest_2', 'volume_2',
'close_3', 'high_3', 'low_3', 'open_3', 'open_interest_3', 'volume_3']
pretty_cols = ['Date', 'Market1: Close', 'Market1: High', 'Market1: Low', 'Market1: Open',
'Market2: Close', 'Market2: High', 'Market2: Low', 'Market2: Open', 'Market2: Open Interest', 'Market2: Volume',
'Market3: Close', 'Market3: High', 'Market3: Low', 'Market3: Open', 'Market3: Open Interest', 'Market3: Volume']
cols_dict = {cols[i]: pretty_cols[i] for i in range(len(cols))}
df.columns = cols
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='date', inplace=True)
The column with market 2 closing prices contains values which are not numeric. All such values were identified as missing and replaced accordingly with NaN.
df["close_2"] = df["close_2"].replace("^[^.0-9]*$", np.NaN, regex=True).astype(np.float64)
One can notice that the dataset contains some missing values which were handled in the next secction.
The table below shows descriptive statistics of the OHLC, volumes and open interests of the investigated markets. It can be noticed, that market 1 prices are significantly lower than the remaining markets. Prices of market 3 are the highest. Additionally, looking at the max and min values, one can notice that there are some extreme values for market 1 high and market 3 low, which may be considered as potential outliers.
round(df.describe().T, 3)
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| date | 1607 | 2020-02-24 23:59:59.999999744 | 2017-12-14 00:00:00 | 2019-01-19 12:00:00 | 2020-02-25 00:00:00 | 2021-04-01 12:00:00 | 2022-05-08 00:00:00 | NaN |
| close_1 | 1555.000 | 269.115 | 194.350 | 246.900 | 267.350 | 292.025 | 429.450 | 38.452 |
| high_1 | 1555.000 | 433.756 | 0.550 | 249.850 | 269.500 | 294.325 | 252743.000 | 6402.588 |
| low_1 | 1555.000 | 266.502 | 193.550 | 243.025 | 265.050 | 289.650 | 419.550 | 37.998 |
| open_1 | 1555.000 | 269.005 | 193.850 | 246.350 | 267.350 | 292.050 | 430.800 | 38.303 |
| close_2 | 1562.000 | 5958.972 | 4331.000 | 5462.250 | 5926.000 | 6462.750 | 9412.500 | 849.033 |
| high_2 | 1563.000 | 6013.688 | 4394.000 | 5530.750 | 5966.000 | 6505.000 | 9617.000 | 855.094 |
| low_2 | 1563.000 | 5904.792 | 4318.000 | 5378.250 | 5873.500 | 6403.500 | 9315.000 | 841.036 |
| open_2 | 1563.000 | 5959.325 | 4332.500 | 5449.000 | 5923.500 | 6462.500 | 9450.000 | 847.142 |
| open_interest_2 | 1548.000 | 308290.313 | 249719.000 | 290599.750 | 306666.500 | 325907.750 | 378509.000 | 24395.852 |
| volume_2 | 1563.000 | 45301.463 | 7805.000 | 34231.500 | 42350.000 | 53626.000 | 192388.000 | 17802.841 |
| close_3 | 1503.000 | 46306.261 | 33530.000 | 41740.000 | 47120.000 | 50340.000 | 69380.000 | 6207.132 |
| high_3 | 1503.000 | 46636.487 | 34020.000 | 42020.000 | 47360.000 | 50720.000 | 70440.000 | 6235.302 |
| low_3 | 1503.000 | 277958.756 | 33150.000 | 41430.000 | 46880.000 | 50030.000 | 348700000.000 | 8993221.310 |
| open_3 | 1502.000 | 46277.370 | 33530.000 | 41800.000 | 47115.000 | 50330.000 | 70080.000 | 6185.590 |
| open_interest_3 | 1503.000 | 581329.937 | 269483.000 | 505479.000 | 591236.000 | 704982.000 | 952018.000 | 156643.091 |
| volume_3 | 1503.000 | 168440.959 | 4.000 | 64590.500 | 113508.000 | 218862.000 | 1319384.000 | 157557.011 |
A stock’s closing price is the common value used by investors to track the stock performance over time. The plots below show both opening and closing price of commodity at each of 3 markets. It is easy to notice, that prices movements of all markets were driven by the same forces as they have vary similar trends.
markets_no = 3
metrics_no = 2
metrics_names = ['open_', 'close_']
fig, axs = plt.subplots(markets_no, metrics_no, figsize=(25, 10))
for i in range(markets_no):
for j in range(metrics_no):
sns.lineplot(data=df, x='date', y=(metrics_names[j] + str(i+1)), ax=axs[i, j])
axs[i, j].set(ylabel=f'Market {i+1}: {metrics_names[j][:-1].capitalize()}',
xlabel='')
plt.show()
The plots below show how the volume, open interest and their 10-day moving averages for 2nd and 3rd markets have been changing over time. The variance of volume for both markets dicreases with time. Furthermore, volumes recorded at both markets are slightly correlated with the Pearson coefficient equal to 0.447.
fig, axs = plt.subplots(2, 2, figsize=(30, 10))
for i in range(2):
sns.lineplot(data=df, x='date', y=f'volume_{i+2}', ax=axs[i, 0], linewidth = 0.6)
sns.lineplot(x=df['date'], y=df[f'volume_{i+2}'].rolling(10).mean(), ax=axs[i, 0], linewidth = 0.8, alpha=.8)
axs[i, 0].set(ylabel=f'Market {i+2}: Volume', xlabel='')
sns.lineplot(data=df, x='date', y=f'open_interest_{i+2}', ax=axs[i, 1], linewidth = 0.8)
sns.lineplot(x=df['date'], y=df[f'open_interest_{i+2}'].rolling(10).mean(), ax=axs[i, 1], linewidth = 0.9, alpha=.8)
axs[i, 1].set(ylabel=f'Market {i+2}: Open Interest', xlabel='')
plt.show()
df[['open_interest_2', 'volume_2', 'open_interest_3', 'volume_3']].rename(columns=cols_dict).corr()
| Market2: Open Interest | Market2: Volume | Market3: Open Interest | Market3: Volume | |
|---|---|---|---|---|
| Market2: Open Interest | 1.000 | 0.154 | 0.244 | 0.304 |
| Market2: Volume | 0.154 | 1.000 | 0.348 | 0.447 |
| Market3: Open Interest | 0.244 | 0.348 | 1.000 | 0.436 |
| Market3: Volume | 0.304 | 0.447 | 0.436 | 1.000 |
The boxplots below present the distribution of each attribute in the dataset. Despie the fact that most values seem roughly homogenos, the market 1 high price and market 3 low price contains some extreme observations which must be measurements errors. To ensure that, each outlier and its 2 nearest past and future observations were investigated.
metrics_no = 4
metrics_names = ['open_', 'high_', 'low_', 'close_']
fig, axs = plt.subplots(markets_no, metrics_no, figsize=(25, 15))
for i in range(markets_no):
for j in range(metrics_no):
sns.boxplot(data=df, y=(metrics_names[j] + str(i+1)), ax=axs[i, j], boxprops=dict(alpha=0.7))
axs[i, j].set(ylabel=f'Market {i + 1}: {metrics_names[j][:-1].capitalize()}', xlabel='')
plt.show()
After checking the outlier values for highest price at 2nd market and lowest price at 3rd market, it is not hard to notice, that numbers look similar to its neighbours but have higher order of magnitude. Because of that, it was decided, to divide both values accordingly, to make the values having the same scale as the remaining values in the column.
n = 2
idx = df[(df['high_1'] > 5000)].index
df.iloc[(idx[0] - n):(idx[0] + n + 1)]['high_1']
605 265.050 606 261.750 607 252743.000 608 252.700 609 253.500 Name: high_1, dtype: float64
df['high_1'] = np.where(df['high_1'] > 50000, df['high_1'] / 1000, df['high_1'])
idx = df[(df['low_3'] > 500000)].index
df.iloc[(idx[0] - n) : (idx[0] + n + 1)]['low_3']
241 34580.000 242 34480.000 243 348700000.000 244 34880.000 245 35730.000 Name: low_3, dtype: float64
df['low_3'] = np.where(df['low_3'] > 500000, df['low_3'] / 10000, df['low_3'])
After handling extremely high values, it was noticed that 1st market highest prices distribution has one more abnormally low value. After checking it, no pattern was found, so the value was considered as missing and replaced with NaN.
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(data=df, y="high_1", ax=axs[0], boxprops=dict(alpha=0.7))
axs[0].set(ylabel='Market 1: High', xlabel='')
sns.boxplot(data=df, y="low_3", ax=axs[1], boxprops=dict(alpha=0.7))
axs[1].set(ylabel='Market 3: Low', xlabel='')
plt.show()
idx = df[(df['high_1'] < 100)].index
df.iloc[(idx[0] - n) : (idx[0] + n + 1)]['high_1']
382 214.450 383 217.000 384 0.550 385 213.900 386 218.350 Name: high_1, dtype: float64
df['high_1'] = np.where(df['high_1'] < 100, np.NaN, df['high_1'])
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(data=df, y="high_1", ax=axs[0], boxprops=dict(alpha=0.7))
axs[0].set(ylabel='Market 1: High', xlabel='')
sns.boxplot(data=df, y="low_3", ax=axs[1], boxprops=dict(alpha=0.7))
axs[1].set(ylabel='Market 3: Low', xlabel='')
plt.show()
The analysed dataset contains many missing values which were imputed using the previous element. It is straightforward and commonly used method of imputing missing valeus in time series i nline with the assumption that we do not know the future values of the time series. The example entries before and after imputation are displayed below.
df[['close_2', 'high_2', 'low_2', 'open_2']].iloc[251:258]
| close_2 | high_2 | low_2 | open_2 | |
|---|---|---|---|---|
| 251 | 4665.500 | 4745.000 | 4665.350 | 4745.000 |
| 252 | 4720.000 | 4720.150 | 4654.500 | 4678.000 |
| 253 | 4692.000 | 4721.000 | 4672.500 | 4718.000 |
| 254 | NaN | NaN | NaN | NaN |
| 255 | NaN | NaN | NaN | NaN |
| 256 | 4729.000 | 4750.000 | 4617.000 | 4639.000 |
| 257 | 4735.000 | 4746.000 | 4700.000 | 4726.000 |
df = df.ffill()
df[['close_2', 'high_2', 'low_2', 'open_2']].iloc[251:258]
| close_2 | high_2 | low_2 | open_2 | |
|---|---|---|---|---|
| 251 | 4665.500 | 4745.000 | 4665.350 | 4745.000 |
| 252 | 4720.000 | 4720.150 | 4654.500 | 4678.000 |
| 253 | 4692.000 | 4721.000 | 4672.500 | 4718.000 |
| 254 | 4692.000 | 4721.000 | 4672.500 | 4718.000 |
| 255 | 4692.000 | 4721.000 | 4672.500 | 4718.000 |
| 256 | 4729.000 | 4750.000 | 4617.000 | 4639.000 |
| 257 | 4735.000 | 4746.000 | 4700.000 | 4726.000 |
df.isna().sum().sum()
0
Machine learning models can be trained to predict the stock movements on historical market data including OHLC prices and volumes. Additionally, another common approach is to use technical indicators which may help to infer the behavior or trends of a time series. In the following sections some popular technical indicators were calculated and visualized to demosnstarte how the stock performance has been changing over the time.
As the aim is to predict the directions of daily change of the stock open price, the target variable was categorized as "0 or "1", where "0" means that the next day’s open price is lower than today’s open price, and "1" means that the next day’s price went up. It can be seen, that total counts when stock opening prices went up and down are very similar.
df['outcome'] = (df['open_2'][1:].values > df['open_2'][:-1]).astype(int)
df[['open_2', 'outcome']].head(6)
| open_2 | outcome | |
|---|---|---|
| 0 | 6255.000 | 0.000 |
| 1 | 6150.000 | 0.000 |
| 2 | 6130.000 | 0.000 |
| 3 | 6126.500 | 1.000 |
| 4 | 6132.000 | 0.000 |
| 5 | 6096.000 | 0.000 |
df["outcome"] = df["outcome"].map({1: "Up", 0: "Down"})
plt.figure(figsize=(7,6))
ax = sns.countplot(x='outcome', hue='outcome', data=df, order = df['outcome'].value_counts().index)
ax.set(xlabel="Market 2: open to open price movements", ylabel="Count")
ax.bar_label(ax.containers[0])
ax.get_legend().set_visible(False)
plt.show()
In the section below, the daily returns of closing and opening prices were calculated and displayed on the plots. One can notice, that periods of high volatility, when the closing prices were changing sharply, are usually common for all markets. As the closing price is ahead of opening price, to avoid data leakage, the daily returns were lagged.
def log_daily_return(v):
return np.log(v[1:] / v[:-1].values)
for i in range(markets_no):
df[f'return_op_{i+1}'] = log_daily_return(df[f'open_{i+1}'])
df[f'return_cl_{i+1}'] = log_daily_return(df[f'close_{i+1}'])
fig, axs = plt.subplots(markets_no, 1, figsize=(25, 12))
for i in range(markets_no):
sns.lineplot(data=df, x='date', y=f'return_cl_{i+1}', ax=axs[i], linewidth = 0.6, color=palette[0])
sns.lineplot(data=df, x='date', y=f'return_op_{i+1}', ax=axs[i], linewidth = 0.6, color=palette[1])
axs[i].set(ylabel=f'Market {i+1}: Daily return', xlabel='')
plt.show()
for i in range(markets_no):
df[f'return_cl_{i+1}_lag_1'] = df[f'return_cl_{i+1}'].shift(1)
df[f'return_cl_{i+1}_lag_2'] = df[f'return_cl_{i+1}'].shift(2)
cols_dict[f'return_cl_{i+1}_lag_1'] = f'M{i+1}: Daily return close lag(1)'
cols_dict[f'return_cl_{i+1}_lag_2'] = f'M{i+1}: Daily return close lag(2)'
df[f'return_op_{i+1}_lag_1'] = df[f'return_op_{i+1}'].shift(1)
df[f'return_op_{i+1}_lag_2'] = df[f'return_op_{i+1}'].shift(2)
cols_dict[f'return_op_{i+1}_lag_1'] = f'M{i+1}: Daily return open lag(1)'
cols_dict[f'return_op_{i+1}_lag_2'] = f'M{i+1}: Daily return open lag(2)'
for i in range(markets_no):
df.drop(columns=[f'return_cl_{i+1}'], inplace=True)
df.drop(columns=[f'return_op_{i+1}'], inplace=True)
On the histograms below, it can be noticed that distribution of daily returns on 1st and 2nd market differs from the distribution of daily returns on 3rd market. On 3rd market changes of prices are more stable and centered around the mean, what indicates that the dstribution is leptokurtic. Days when stock opening price went up were sometimes preceded by the day with positive daily return. In other words, the more close-to-close price increased the day before, the higher probability that movement in the same direction occure again. Nevertheless, this pattern is barely visible and most often the daily return from the previous day do not affect they opening price movements direction the next day. Because some processes may affect other processes only after some time, 1- and 2-day lags of daily returns were included in the dataset.
fig, axs = plt.subplots(1, markets_no, figsize=(20, 7))
for i in range(markets_no):
sns.histplot(data=df, x=df[f'return_op_{i+1}_lag_1'], kde=True, hue='outcome', ax=axs[i], binwidth=0.003)
axs[i].set(xlabel=f'Market {i+1}: Daily return open lagged by 1 day')
axs[i].legend(title='Price movements', loc='upper right', labels=['Up', 'Down'])
axs[i].set(ylim=(0, 200))
plt.show()
fig, axs = plt.subplots(1, markets_no, figsize=(20, 7))
for i in range(markets_no):
sns.histplot(data=df, x=df[f'return_cl_{i+1}_lag_1'], kde=True, hue='outcome', ax=axs[i], binwidth=0.003)
axs[i].set(xlabel=f'Market {i+1}: Daily return close lagged by 1 day')
axs[i].legend(title='Price movements', loc='upper right', labels=['Up', 'Down'])
axs[i].set(ylim=(0, 200))
plt.show()
The plot below provides additional information about relationship between price changes and their 1-day lags of all stocks. Closing price changes of market 1 and 2 are almost no correlated in the same period. No relationship can be observed between daily returns of market 3 and the two remaining markets at the same day, however, daily returns of market 3 are slighly more correlated with 1-day lag of market 1 and 2.
plot_df = df[['return_cl_1_lag_1', 'return_cl_2_lag_1', 'return_cl_3_lag_1', 'return_cl_1_lag_2', 'return_cl_2_lag_2', 'return_cl_3_lag_2', 'outcome']].copy().rename(cols_dict, axis=1)
g = sns.pairplot(plot_df, hue='outcome', height=4)
g._legend.set_title('Movement')
In general, technical indicators are analysed and plotted to help some investors make their buy and sell decisions by providing information about the security’s price trends and performance. In this analysis technical indicators will be used as predictors when training the price movements classifier.
The simple moving avarage indicates the avarage trend of variable over a specific time period. It smooths abnormal signals and is calculated as an average taken from today and past n closing prices.
def MA(v, period):
return v.rolling(period).mean()
ma_period = [1, 5, 10, 20, 40]
for i in range(markets_no):
for period in ma_period:
df[f'ma_{i+1}_{period}'] = MA(df[f'close_{i+1}'], period).shift(1)
The exponential moving average gives higher weights to the latest data points which help to respond to price changes faster than a simple MA.
def EMA(v, period):
ema = [sum(v[:period]) / period]
for value in v[period:]:
ema.append((value * (2 / (1 + period))) + ema[-1] * (1 - (2 / (1 + period))))
return [np.nan] * (period - 1) + ema
for i in range(markets_no):
for period in ma_period:
df[f'ema_{i+1}_{period}'] = pd.Series(EMA(df[f'close_{i+1}'], period)).shift(1)
The candlestick charts below show the OHLC price data and 2-week SMA and EMA of commodity from 1st and 2nd market over 2 first month of 2022. Price changes are very similar on both markets. There were hardly couple of days when the direction of open-to-close price changes as of the same date differed between the markets. One can also observe the trend of monthly SMA and EMA are mainly close to each other but start to deviate during periods when price trend change the direction.
d1 = datetime(2021, 12, 30)
d2 = datetime(2022, 2, 28)
candle_df = df[(df['date'] > d1) & (df['date'] < d2)]
prices = candle_df[['date', 'open_2', 'high_2', 'low_2', 'close_2']]
prices.columns = ['date', 'open', 'high', 'low', 'close']
up = prices[prices.close>=prices.open]
down = prices[prices.close<prices.open]
width = .6
width2 = .06
col1 = 'black'
col2 = 'teal'
fig, ax = plt.subplots(1, 1, figsize=(25, 7))
plt.bar(up['date'], up['close']-up['open'], width, bottom=up['open'], color=col1, alpha=.7)
plt.bar(up['date'], up['high']-up['close'], width2, bottom=up['close'], color=col1, alpha=.7)
plt.bar(up['date'], up['low']-up['open'], width2, bottom=up['open'], color=col1, alpha=.7)
plt.bar(down['date'],down['close']-down['open'], width, bottom=down['open'], color=col2, alpha=.7)
plt.bar(down['date'],down['high']-down['open'], width2, bottom=down['open'], color=col2, alpha=.7)
plt.bar(down['date'],down['low']-down['close'], width2, bottom=down['close'], color=col2, alpha=.7)
plt.plot(prices['date'], MA(prices['close'], 10), color=col1, alpha=.5)
plt.plot(prices['date'], EMA(prices['close'], 10), color=col2, alpha=.7)
ax.set_xlabel('Date')
ax.set_ylabel('')
fig.suptitle('Market Commodity prices: Jan-Feb 2022', size=14)
plt.show()
prices = candle_df[['date', 'open_1', 'high_1', 'low_1', 'close_1']]
prices.columns = ['date', 'open', 'high', 'low', 'close']
up = prices[prices.close>=prices.open]
down = prices[prices.close<prices.open]
fig, ax = plt.subplots(1, 1, figsize=(25, 7))
plt.bar(up['date'], up['close']-up['open'], width, bottom=up['open'], color=col1, alpha=.7)
plt.bar(up['date'], up['high']-up['close'], width2, bottom=up['close'], color=col1, alpha=.7)
plt.bar(up['date'], up['low']-up['open'], width2, bottom=up['open'], color=col1, alpha=.7)
plt.bar(down['date'],down['close']-down['open'], width, bottom=down['open'], color=col2, alpha=.7)
plt.bar(down['date'],down['high']-down['open'], width2, bottom=down['open'], color=col2, alpha=.7)
plt.bar(down['date'],down['low']-down['close'], width2, bottom=down['close'], color=col2, alpha=.7)
plt.plot(prices['date'], MA(prices['close'], 10), color=col1, alpha=.5)
plt.plot(prices['date'], EMA(prices['close'], 10), color=col2, alpha=.7)
ax.set_xlabel('Date')
ax.set_ylabel('')
fig.suptitle('Market 1: Commodity prices: Jan-Feb 2022', size=14)
plt.show()
RSI indicates whether an asset is strong and provides signals that tell investors to buy when the security is oversold and to sell it when it is overbought. It is given by the formula:
where $RS$ is given as the ratio of average higher closes gains to average lower closes losts. Stocks which have had more or stronger positive changes have a higher RSI than stocks which have had more or stronger negative changes.
def RSI(v):
v_diff = v.diff()
v_diff.dropna(inplace=True)
v_diff_up, v_diff_down = v_diff.copy(), v_diff.copy()
v_diff_up[v_diff_up < 0] = 0
v_diff_down[v_diff_down > 0] = 0
avg_up = v_diff_up.rolling(14).mean()
avg_down = v_diff_down.rolling(14).mean().abs()
return avg_up / (avg_up + avg_down) * 100
for i in range(markets_no):
df[f'rsi_{i+1}'] = RSI(df[f'close_{i+1}']). shift(1)
The moving average convergence/divergence indicator show the difference between the 26-period EMA and the 12-period EMA. It can provide two pieces of information: the trend direction of the stock price and the momentum of the share price.
def MACD(v):
return np.subtract(EMA(v, 26), EMA(v, 12))
for i in range(markets_no):
df[f'macd_{i+1}'] = pd.Series(MACD(df[f'close_{i+1}'])).shift(1)
The plot below shows closing prices of the commodity from all 3 markets and two techncial indicators: relative strength index and moving average convergence/divergence.
candle_df = df[(df['date'] > d1) & (df['date'] < d2)]
fig, axs = plt.subplots(3, markets_no, figsize=(30, 10))
for i in range(markets_no):
sns.lineplot(data=candle_df, x='date', y=f'close_{i+1}', ax=axs[0, i])
axs[0, i].set(ylabel=f'M{i+1}: Close', xlabel='')
sns.lineplot(data=candle_df, x='date', y=f'rsi_{i+1}', ax=axs[1, i])
axs[1, i].set(ylabel=f'M{i+1}: RSI', xlabel='')
sns.lineplot(data=candle_df, x='date', y=f'macd_{i+1}', ax=axs[2, i])
axs[2, i].set(ylabel=f'M{i+1}: MA convergence/divergence', xlabel='')
plt.show()
df["outcome"] = df["outcome"].map({"Up": 1, "Down": 0})
x, y = [], []
for i in range(3):
for j in ma_period:
x.append(f'ma_{i+1}_{j}')
y.append(f'ema_{i+1}_{j}')
cols_dict[f'ma_{i+1}_{j}'] = f'M{i+1}: {j}-day Simple MA'
cols_dict[f'ema_{i+1}_{j}'] = f'M{i+1}: {j}-day Exponential MA'
df_corr_cols = ['outcome',
'return_cl_1_lag_1', 'return_cl_1_lag_2', 'return_op_1_lag_1', 'return_op_1_lag_2',
'return_cl_2_lag_1', 'return_cl_2_lag_2', 'return_op_2_lag_1', 'return_op_2_lag_2',
'return_cl_3_lag_1', 'return_cl_3_lag_2', 'return_op_3_lag_1', 'return_op_3_lag_2',
'rsi_1', 'rsi_2', 'rsi_3', 'macd_1', 'macd_2',
'macd_3']
pretty_cols = ['Outcome',
'M1: Daily returns close lag 1', 'M1: Daily returns close lag 2', 'M1: Daily returns open lag 1', 'M1: Daily returns open lag 2',
'M2: Daily returns close lag 1', 'M2: Daily returns close lag 2', 'M2: Daily returns open lag 1', 'M2: Daily returns open lag 2',
'M3: Daily returns close lag 1', 'M3: Daily returns close lag 2', 'M3: Daily returns open lag 1', 'M3: Daily returns open lag 2',
'M1: RSI', 'M2: RSI', 'M3: RSI',
'M1: MA Conv/div', 'M2: MA Conv/div', 'M3: MA Conv/div',]
for i in range(len(df_corr_cols)):
cols_dict[df_corr_cols[i]] = pretty_cols[i]
Also the closing, minimum and maximum prices were shifted by 1 assuming that they only the opening price is known at particular day.
cols_to_shift = ['close_1', 'high_1', 'low_1',
'close_2', 'high_2', 'low_2',
'open_interest_2', 'volume_2', 'close_3',
'high_3', 'low_3', 'open_interest_3', 'volume_3']
df[cols_to_shift] = df[cols_to_shift].shift(1)
The heatmap below describes the degree of linear relationship between each two features selected from the dataset. The highest correlation occuring between target variable and the daily returns but it is almost qual to 0 anyway. Other variables are even less correlated with the price movements. Additionally, daily returns of 1st and 2nd market are correlated with each other and with the 1-day lag of 3rd market returns. Strong, negative correlation characterizes RSI indeces with MACD indicators. Simple MA and EMA values of all markets are also correlated with each other.
df_corr_cols = ['outcome', 'close_1', 'close_2', 'open_interest_2', 'volume_2',
'close_3', 'high_3', 'low_3', 'open_interest_3', 'volume_3',
'return_cl_1_lag_1', 'return_cl_1_lag_2', 'return_op_1_lag_1', 'return_op_1_lag_2',
'return_cl_2_lag_1', 'return_cl_2_lag_2', 'return_cl_3_lag_1', 'return_cl_3_lag_2',
'ma_1_10', 'ma_2_10', 'ma_3_10',
'rsi_1', 'rsi_2', 'rsi_3',
'macd_1', 'macd_2', 'macd_3']
df_corr = df[df_corr_cols].copy().rename(cols_dict, axis=1)
plt.figure(figsize=(25, 20))
mask = np.triu(np.ones_like(df_corr.corr(), dtype=bool))
heatmap = sns.heatmap(df_corr.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='vlag', annot_kws={"size": 13})
In the following section, the dataset was split into train and validation set. Validation set consists of the latest 50 trading days. Depending on the type of technical indicator, some top values of these attributes are missing. Because of that, mentioned rows were removed from the dataset. Only some of the available features were used in the process of model training.
df.set_index('date', inplace=True)
df.columns
Index(['close_1', 'high_1', 'low_1', 'open_1', 'close_2', 'high_2', 'low_2',
'open_2', 'open_interest_2', 'volume_2', 'close_3', 'high_3', 'low_3',
'open_3', 'open_interest_3', 'volume_3', 'outcome', 'return_cl_1_lag_1',
'return_cl_1_lag_2', 'return_op_1_lag_1', 'return_op_1_lag_2',
'return_cl_2_lag_1', 'return_cl_2_lag_2', 'return_op_2_lag_1',
'return_op_2_lag_2', 'return_cl_3_lag_1', 'return_cl_3_lag_2',
'return_op_3_lag_1', 'return_op_3_lag_2', 'ma_1_1', 'ma_1_5', 'ma_1_10',
'ma_1_20', 'ma_1_40', 'ma_2_1', 'ma_2_5', 'ma_2_10', 'ma_2_20',
'ma_2_40', 'ma_3_1', 'ma_3_5', 'ma_3_10', 'ma_3_20', 'ma_3_40',
'ema_1_1', 'ema_1_5', 'ema_1_10', 'ema_1_20', 'ema_1_40', 'ema_2_1',
'ema_2_5', 'ema_2_10', 'ema_2_20', 'ema_2_40', 'ema_3_1', 'ema_3_5',
'ema_3_10', 'ema_3_20', 'ema_3_40', 'rsi_1', 'rsi_2', 'rsi_3', 'macd_1',
'macd_2', 'macd_3'],
dtype='object')
df_model = df[['open_interest_3', 'volume_3', 'outcome', 'return_cl_1_lag_1',
'return_cl_1_lag_2', 'return_op_1_lag_1',
'return_cl_2_lag_1', 'return_cl_2_lag_2', 'return_op_2_lag_1',
'return_op_2_lag_2', 'return_cl_3_lag_1',
'return_op_3_lag_1', 'return_op_3_lag_2', 'ma_1_1', 'ma_1_5', 'ma_1_10',
'ma_1_20', 'ma_1_40', 'ma_2_1', 'ma_2_5', 'ma_2_10', 'ma_2_20',
'ma_2_40', 'ma_3_1', 'ma_3_5', 'ma_3_10', 'ma_3_20', 'ma_3_40',
'ema_1_1', 'ema_1_5', 'ema_1_10', 'ema_1_20', 'ema_1_40', 'ema_2_1',
'ema_2_5', 'ema_2_10', 'ema_2_20', 'ema_2_40', 'ema_3_1', 'ema_3_5',
'ema_3_10', 'ema_3_20', 'ema_3_40', 'rsi_1', 'rsi_2', 'rsi_3', 'macd_1',
'macd_2', 'macd_3']].copy()
df_model.isna().any().sum()
49
df_model = df_model.loc[df_model.index[~df_model.isnull().any(axis=1)]]
# train test split
x_cols = df_model.columns.tolist()
x_cols.remove('outcome')
y, X = df_model['outcome'], df_model[x_cols]
X_train, X_test, y_train, y_test = X.iloc[:-50], X.iloc[-50:], y.iloc[:-50], y.iloc[-50:]
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((1516, 48), (1516,), (50, 48), (50,))
X_train.isna().sum().sum()
0
X_test.isna().sum().sum()
0
In this part the train and test set were standardized using the mean and standard deviation calculated from training set. Standardization can be beneficial, because improves the numerical stability of the model and often reduces time of model building.
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)
Because the analysed dataset contains many groups of correlated variables including daily returns, SMA & EMA and sets of the same technical indicators, it was decided to use PCA in order to reduce the data dimensionality. Principal Component Analysis combines correlated variables into a smaller set of linearly uncorrelated features. By doing so, it removes noises in the data and reduces model training time.
To select number of components which gives the best model performance, 20 logistic regression models with different number of principal components as variables were fitted. To asses model performance, the 5-cross-validation mean accuracy scores were calculated. The results are shown on the plot below. Based on that information, 2 principal components were selected as the best parameter, since the cross-validation accuracy exceeded 50% only for this number.
seed = 123
components_n = [int(x) for x in np.linspace(1, 20, 20, endpoint=True)]
lr = LogisticRegression(solver='liblinear', random_state=seed)
k_folds = KFold(n_splits=5)
test_acc = []
for n in components_n:
pca = PCA(n_components=n)
X_train_pca = pca.fit_transform(X_train_sc)
scores = cross_val_score(lr, X_train_pca, y_train, cv=k_folds)
test_acc.append(scores.mean())
pca_data = pd.DataFrame({"Components": components_n, "Cross-validation accuracy": test_acc})
fig, axs = plt.subplots(1, 1, figsize=(12, 6))
ax = sns.lineplot(data=pca_data, x="Components", y="Cross-validation accuracy", linewidth=2)
ax.xaxis.set_major_locator(ticker.MultipleLocator(2))
plt.show()
The logistic regression is parametric method. It means that it assumes some functional form of the model. In logistic regression, a logit transformation is applied on the odds - ratio of the probability of success to the probability of failure.
To predict open-to-open price movements of commodity on 2nd market, the logistic regression model was built with use of 2 principal components as independent variables. The model accuracy ona test set is 0.38. It means that model is able to predict only 38% of price movements correctly. Random classifier would be better in this case. The precision value of 0.5 says that when model predicts decrease in opening price, it is correct in 50% of the time. At the same time recall equal to 0.097 means that model correctly identifies 9.7% of all price growths. Specificity indicates that 84.2% of price drops were correctly identified by the model. The F1 score which measures both the precision and recall equals 16.2%.
def model_perf_report(y, y_hat):
acc = accuracy_score(y, y_hat)
cf_matrix = confusion_matrix(y, y_hat)
tn, fp, fn, tp = cf_matrix.ravel()
precision = precision_score(y, y_hat)
recall = recall_score(y, y_hat)
specificity = tn / (tn + fp)
f1 = f1_score(y, y_hat)
print(f'Accuracy = {round(acc, 3)}',
f'Precision = {round(precision, 3)}',
f'Recall = {round(recall, 3)}',
f'Specificity = {round(specificity, 3)}',
f'F1 = {round(f1, 3)}', sep='\n')
pca = PCA(n_components = 2)
X_train_pca = pca.fit_transform(X_train_sc)
X_test_pca = pca.transform(X_test_sc)
classifier_pca = LogisticRegression(random_state = seed)
classifier_pca.fit(X_train_pca, y_train)
y_pred_lr_pca = classifier_pca.predict(X_test_pca)
model_perf_report(y_test, y_pred_lr_pca)
Accuracy = 0.38 Precision = 0.5 Recall = 0.097 Specificity = 0.842 F1 = 0.162
cf_matrix = confusion_matrix(y_test, y_pred_lr_pca)
group_names = ['TN','FP','FN','TP']
group_counts = [round(value, 3) for value in cf_matrix.flatten()/cf_matrix.sum()]
labels = [f'{v1}\n\n{v2}' for v1, v2 in zip(group_names, group_counts)]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
plt.show()
To test a logistic regression model trained with use of 2 principal components, a model without PCA was created as a benchmark. The second model's accuracy ona test set is 0.44. It gives a little improvement but model is still worse than random classifier. The precision value of 1 says that when model predicts downward moves of opening price, it is correct in 100% of the time. At the same time recall equal to 0.097 means that model correctly identifies 9.7% of all price growths. Specificity indicates that 100% of price drops were correctly identified by the model. The F1 score which measures both the precision and recall equals 17.6%. All that results indicate that the model is rather conservative and predicts drops for most of the time. Moreover, using all the available features improves the model classification abilities.
classifier = LogisticRegression(random_state = seed)
classifier.fit(X_train_sc, y_train)
y_pred_lr = classifier.predict(X_test_sc)
model_perf_report(y_test, y_pred_lr)
Accuracy = 0.44 Precision = 1.0 Recall = 0.097 Specificity = 1.0 F1 = 0.176
cf_matrix = confusion_matrix(y_test, y_pred_lr)
group_names = ['TN','FP','FN','TP']
group_counts = [round(value, 3) for value in cf_matrix.flatten()/cf_matrix.sum()]
labels = [f'{v1}\n\n{v2}' for v1, v2 in zip(group_names, group_counts)]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
plt.show()
It was deicded to built another, non-parametric model - random forest classifier to chceck if it will outperform the parametric logistic regression. Random forest was trained on the whole not standardized test set data, as this algorithm is robust on features with different scales and has built-in mechanism of indicating important predictors without favouring the strongest ones.
The model uses multiple decision trees and make a decision on the basis of majority vote. When constructing each node of decision tree, only a small part of all features is considered. It prevents model from using the same, strongest predictors when building the trees. As a result many uncorrelated decision trees are created which predictions are avaraged to obtain a final classification decision.
Before training the model, hyperparameter tuning was conducted to select the combination of model parameters giving the lowest cross-validation test error rate. It was done with use of randomized search, which search through random values selected from the parameter space we define. The method indicated a set of following optimal hyperparameters:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(50, 500, 10, endpoint=True)]
# Number of features to consider at every split
max_features = ['log2', 'sqrt']
# Max number of levels in tree
max_depth = [int(x) for x in np.linspace(2, 7, 6, endpoint=True)]
# Min number of observations allowed in a leaf node
min_samples_leaf = [int(x) for x in np.linspace(1, 20, 5, endpoint=True)]
# Min number of observations in a node before splitting the node
min_samples_split = [int(x) for x in np.linspace(5, 50, 10, endpoint=True)]
param_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split}
rf = RandomForestClassifier(random_state=seed)
rf_cv = RandomizedSearchCV(rf, param_grid,
cv=5, verbose=2, n_iter = 40,
random_state=seed, n_jobs = -1)
rf_cv.fit(X_train, y_train)
rf_cv.best_params_
Fitting 5 folds for each of 40 candidates, totalling 200 fits
{'n_estimators': 100,
'min_samples_split': 10,
'min_samples_leaf': 5,
'max_features': 'log2',
'max_depth': 2}
The trained random forest classifier gives similar accuracy as the logistic regression. The precision score is equal to 100%, recall equals 3.2% and specificity equals 100%. It means that 100% of price movements thought by the classifier to down up, actually went up. Value of recall says that out off all price growths, only 3.2% were correctly forecasted. Specificity indicates that 100% of price drops were classified to the correct class. The F1 score which measures the overall performance, including both the precision and recall equals 6.2%.
rf = rf_cv.best_estimator_
rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=2, max_features='log2', min_samples_leaf=5,
min_samples_split=10, random_state=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=2, max_features='log2', min_samples_leaf=5,
min_samples_split=10, random_state=123)y_train_pred = rf.predict(X_train)
y_pred_rf = rf.predict(X_test)
model_perf_report(y_test, y_pred_rf)
Accuracy = 0.4 Precision = 1.0 Recall = 0.032 Specificity = 1.0 F1 = 0.062
cf_matrix = confusion_matrix(y_test, y_pred_rf)
group_names = ['TN','FP','FN','TP']
group_counts = [round(value, 3) for value in cf_matrix.flatten()/cf_matrix.sum()]
labels = [f'{v1}\n\n{v2}' for v1, v2 in zip(group_names, group_counts)]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
plt.show()
Important insight we can gain from a random forest is to check which features turned out to be the most important in determining the stock movement direction. We can see that daily returns of price on 1st and 2nd market contribute to the model prediction of the future movement direction. The remaining important features includes RSI of 1st market, volume of 3rd market and open ineterst of commodity on 3rd market.
feature_names = X_test.columns
importances = rf.feature_importances_
imp_df = {'feature_names':feature_names, 'feature_importance': importances}
imp_df = pd.DataFrame(imp_df)
imp_df['feature_names'] = imp_df['feature_names'].map(cols_dict)
#Sort the DataFrame in order decreasing feature importance
imp_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
imp_df = imp_df[imp_df['feature_importance'] > 0.04]
#Define size of bar plot
plt.figure(figsize=(20,5))
#Plot Searborn bar chart
ax = sns.barplot(imp_df, y='feature_importance', x='feature_names', palette=sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=False))
ax.set_ylabel("Feature importances")
ax.set_xlabel("Feature names")
plt.show()
On the plot below, it is easy to compare both models in terms of their prediction abilities. Logistic regression without features reduction obtained the best results aout of all 3 variants, but they are unpromising anyway. Each model has accuracy lower than 50% on the test set. It means that they are worse than random classifier. Although they are bad in predicting stock prices growth, they are conservative and usually predict drops of prices. Because of that, they are characterized by high specificity and precision. Nevertheless, all models should not be recommended as tools helping make investment decision as they evince a very poor ability to predict price growths.
# https://www.python-graph-gallery.com/radar-chart/
categories = ['Accuracy', 'Recall', 'Precision', 'Specificity', 'F1']
N = len(categories)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]
plt.figure(figsize=(10, 10))
ax = plt.subplot(111, polar=True)
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], categories, size=15)
ax.set_rlabel_position(0)
plt.yticks([0.2, 0.4, 0.6, 0.8], color="grey", size=12)
plt.ylim(0, 1)
cf_matrix = confusion_matrix(y_test, y_pred_rf)
tn, fp, fn, tp = cf_matrix.ravel()
values = [accuracy_score(y_test, y_pred_rf),
recall_score(y_test, y_pred_rf),
precision_score(y_test, y_pred_rf),
tn / (tn + fp),
f1_score(y_test, y_pred_rf)]
values += values[:1]
ax.plot(angles, values, linewidth=1.1, linestyle='solid', label="Random Forest")
cf_matrix = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = cf_matrix.ravel()
values = [accuracy_score(y_test, y_pred_lr),
recall_score(y_test, y_pred_lr),
precision_score(y_test, y_pred_lr),
tn / (tn + fp),
f1_score(y_test, y_pred_lr)]
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label="Logistic Regression without PCA")
cf_matrix = confusion_matrix(y_test, y_pred_lr_pca)
tn, fp, fn, tp = cf_matrix.ravel()
values = [accuracy_score(y_test, y_pred_lr_pca),
recall_score(y_test, y_pred_lr_pca),
precision_score(y_test, y_pred_lr_pca),
tn / (tn + fp),
f1_score(y_test, y_pred_lr_pca)]
values += values[:1]
ax.plot(angles, values, linewidth=1, linestyle='solid', label="Logistic Regression with PCA")
plt.title('Model performance comparison', size=20, y=1.05)
plt.legend(prop={'size': 15})
plt.show()
The aim of the following report was to introduce and compare two machine learning models which are meant to forecast the open to open price movement for the next day, based on the provided OHLC data. To do so, many technical indeces were estimated including daily returns, simple and exponential MA and relative stength index. First model, that is logistic regression, was trained with use of 2 variables resulted from performing PCA on the whole training dataset. The model obtained the test accuracy equal to 38%, precision of 50% and recall equal to 9.7%. The second model was logistic regression without features reduction. This model turned out to be be the best out of all 3 models in terms of the highest F1 score. However, the classifier performs very poorly, being able to predict only 44% of price movements accurately. The third model was a non-parametric random forest classifier which obtained 40% accuracy, 100% precision and recall equal to 3.2%. Also, the algorithm indicates the most powerful predictors in the task of price movmenets prediction, are daily returns and daily volume recorded at the 3rd Market.
The poor performance of the analysed classifiers proves that predicting stock price movements is a difficult task. No correlation between past and current stock price movements indicates that stock market is driven by unpredictible forces. Obviously, the presented models should not be recommended as a tool helping investors in their financial decisions, as they are able to predict only most of the drops. Classifiers are incapable of predicting upward movements and they overall performance is worst than a naive random classifir. This project also indicates that including technical indices does not improve the model predictiveness.